Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Expose a Utf8String type. #17872

Closed
wants to merge 58 commits into from
Closed

Expose a Utf8String type. #17872

wants to merge 58 commits into from

Conversation

ghost
Copy link

@ghost ghost commented May 3, 2018

We want to start prototyping Utf8String in CoreFxLab
and for that, we'll need a min-bar System.Utf8String
class exposed from System.Private.CoreLib.

The MethodTable is tricked out exactly as System.String
is except that ComponentSize is 1 rather than 2.

Other than that, this is copy-paste from the
StringObject code with all the necessary offerings
needed to keep the build happy.

We want to start prototyping Utf8String in CoreFxLab
and for that, we'll need a min-bar System.Utf8String
class exposed from System.Private.CoreLib.

The MethodTable is tricked out exactly as System.String
is except that ComponentSize is 1 rather than 2.

Other than that, this is copy-paste from the
StringObject code with all the necessary offerings
needed to keep the build happy.
@ghost ghost self-assigned this May 3, 2018
@ghost ghost requested review from jkotas and GrabYourPitchforks May 3, 2018 14:18
// Do not reorder these fields. Must match layout of Utf8StringObject in object.h.
private int _length;
[CLSCompliant(false)]
public byte _firstByte; // TODO: Is public for experimentation in CoreFxLab. Will be private in its ultimate form.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this to be public? It would be better to have readonly byte GetPinnableReference() method - that is the proper public method we will want to have eventually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows the code in CoreFxLab to be written as instance members of Utf8String itself would be.

Though if GetPinnableReference() adds no overhead, I'm fine with it. This PR is unblock @GrabYourPitchforks so I'll let him make the call here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, use GetPinnableReference instead if you wish. The only overhead I can think of is that an implicit null check may be emitted, but the JIT is generally good about eliding these where possible. If we find stray null checks we can open JIT bugs.

@jkotas
Copy link
Member

jkotas commented May 3, 2018

What do you expect to get by having this in CoreLib for prototyping?

For prototyping, you can just have a class that wraps byte[] array, no need to have anything in corelib.

@jkotas
Copy link
Member

jkotas commented May 3, 2018

src/vm/ecalllist.h:1391:40: error: use of undeclared identifier 'gUtf8StringFuncs'; did you mean 'gStringFuncs'

@ghost
Copy link
Author

ghost commented May 3, 2018

cc @GrabYourPitchforks

The benefit is to be able to benchmark and demo using an implementation that will have the same memory usage characteristics as what we'd ship.

@stephentoub
Copy link
Member

If it's a prototype, why not keep it in a separate branch rather than master?

@ghost
Copy link
Author

ghost commented May 3, 2018

Is it possible to consume a CoreCLR from a dev branch into CoreFxLab?

@stephentoub
Copy link
Member

Is it possible to consume a CoreCLR from a dev branch into CoreFxLab?

I don't know if that's currently possible, but you can certainly do so locally. My opinion is master should be for stuff we're on track to ship, not prototypes.

@jkotas
Copy link
Member

jkotas commented May 3, 2018

The benefit is to be able to benchmark and demo using an implementation that will have the same memory usage characteristics

You would not be really able use this for bench-marking because of the allocator is slow and the JIT optimizations are missing.

@ghost
Copy link
Author

ghost commented May 3, 2018

Btw, while we're debating where to merge all this in (we need more stakeholders in the room for that, like the people who will be doing the main work on Utf8String), I'd still like to get eyes on the code itself. This GC-related stuff is tricky.

@jkotas
Copy link
Member

jkotas commented May 3, 2018

This GC-related stuff is tricky.

What have you done to test it?


// Compile the string functionality with these pragma flags (equivalent of the command line /Ox flag)
// Compiling this functionality differently gives us significant throughout gain in some cases.
#if defined(_MSC_VER) && defined(_TARGET_X86_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copy&paste that does not make sense given what's in this file.

// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
//
// File: StringNative.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not match the actual name. Better to just delete - it is useless comment.

// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
//
// File: StringNative.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

src/vm/object.h Outdated
// DWORD m_OptionalPadding (this is an optional field and will appear based on need)

public:
VOID SetUtf8StringLength(DWORD len) { LIMITED_METHOD_CONTRACT; _ASSERTE(len >= 0); m_StringLength = len; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called just SetLength, I think. The managed side has Length property, not Utf8Length property.

@ghost
Copy link
Author

ghost commented May 3, 2018

What have you done to test it?

Stepping through the allocation code, making sure the sizes passed to the GC is ok, tested multiple Utf8String allocations with GC's and random allocations triggered in between, checking to ensure the UTF8String moved in memory but has the same character content.

Make sure MethodTable::IsString still returns true for System.String but not Utf8String.

src/vm/object.h Outdated
// GC will see a Utf8StringObject like this:
// DWORD m_StringLength
// BYTE m_Characters[0]
// DWORD m_OptionalPadding (this is an optional field and will appear based on need)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment about m_OptionalPadding does not make sense to me. Looks like left over from times when System.String was done differently.

@ghost
Copy link
Author

ghost commented May 3, 2018

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

src/vm/vars.hpp Outdated
@@ -315,7 +315,9 @@ class REF : public OBJECTREF
#define ObjectToOBJECTREF(obj) (OBJECTREF(obj))
#define OBJECTREFToObject(objref) ((objref).operator-> ())
#define ObjectToSTRINGREF(obj) (STRINGREF(obj))
#define ObjectToUTF8STRINGREF(obj) (UTF8STRINGREF(obj))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be only needed if we were to have manually managed implementations for UTF8 string. I do not expect we are going to have any. (Except for the fast allocator - that won't need it.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - hoping we won't need this. Ideally we'd be able to keep almost all code related to this type fully managed, with the exception of allocation / GC / p/invoke.

if (cchStringLength > 0x7FFFFFDF)
ThrowOutOfMemory();

SIZE_T ObjectSize = PtrAlign(Utf8StringObject::GetSize(cchStringLength));
Copy link
Member

@jkotas jkotas May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this allocates more than required.

I have noticed that we may have the same problem in regular string. For example:
new string('a', 5) allocates 0x28 bytes objects on x64 today. But it should be enough for it to allocate just 0x20 bytes:

8 syncblock
8 vtable
4 size
5*2 content
2 zero terminator

Do you know why it is the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked CoreRT. CoreRT allocates 0x20 bytes in this case, so there is definitely something pretty broken.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there some padding between size and content?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is padding for regular arrays (so that all arrays have same layout even when elements are 8-byte aligned). There is no padding for strings. The content starts right after length. The extra unnecessary bytes are at the end.

I remember folks went to a great length to ensure that strings do not pay the extra 4 bytes during the initial 64-bit ports. It looks like it has regressed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's coming from sizeof(Utf8StringObject) - there should be a pragma pack 4 around that declaration.

Presumably the same for StringObject, though I want to keep anything like that separate from this PR. We haven't yet agreed to merge this into master.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#17876 has the fix for the String overallocation problem.

namespace System
{
// This is an experimental type and not referenced from CoreFx but needs to exists and be public so we can prototype in CoreFxLab.
public sealed class Utf8String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we add interfaces (IEnumerable, IComparable, ...) to this during development, will we need to touch the VM code at all? That is, does the VM need to know about every possible member that sits on this type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily about interfaces. But the VM and JIT will certainly need to know more about this type to get decent performance. It goes back to my question on what you expect to get by adding this stub to CoreLib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas You're right in that we're not going to see full performance benchmarks until JIT support comes online, but this does get us at least partway there. Even when not JIT-optimized, the supplied allocation routine will still be faster and allocate less overall heap memory than calling two different ctors for a container + separate array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just be a struct wrapper with array as its single field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benaadams We considered this for prototype purposes. But the end result is to have the final type be a class, and any data we'd collect in the meantime from a struct stand-in would be suspect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the supplied allocation routine will still be faster

That is not correct. The allocation routine in this PR is super slow naive implementation. I just tried:

class MyUtf8String
{
   byte[] _payload;

   static public MyUtf8String FastAllocate(int length) { var ret = new MyUtf8String(); ret._payload = new byte[length+1]; return ret; }
}

Run Utf8String.FastAllocate vs. MyUtf8String.FastAllocate in a loop. MyUtf8String loop is faster.

I am with Stephen that it would be best for a rough early prototype code like this to stay in a branch and not be merged into master.

FWIW, Span<T> was also developed in a branch and only integrated to master once it was sufficiently along and we agreed to ship it for real: #7886.


orObject = (Utf8StringObject *)Alloc(ObjectSize, FALSE, FALSE);

// Object is zero-init already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This includes the null terminator being zero-inited, I suppose?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

DWORD baseSize = ObjSizeOf(Utf8StringObject) + sizeof(BYTE);
pMT->SetBaseSize(baseSize); // NULL character included

GetHalfBakedClass()->SetBaseSizePadding(baseSize - bmtFP->NumInstanceFieldBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetHalfBakedClass is this a real method 🤣

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method function (it's c++) 😉

@stephentoub
Copy link
Member

Is it possible to consume a CoreCLR from a dev branch into CoreFxLab?

I seem to remember we had things previously set up to be able to do this. @mmitche, @weshaggard, had you helped with that, publishing NuGet packages from another coreclr branch so they could be consumed into a dev branch in either corefx or corefxlab?

@ghost
Copy link
Author

ghost commented Jun 8, 2018

@dotnet-bot test CentOS7.1 x64 Debug Innerloop Build

@ghost
Copy link
Author

ghost commented Jun 8, 2018

@dotnet-bot test CentOS7.1 x64 Checked Innerloop Build and Test
@dotnet-bot test CentOS7.1 x64 Debug Innerloop Build

@ghost
Copy link
Author

ghost commented Jun 12, 2018

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@ghost
Copy link
Author

ghost commented Jun 25, 2018

@dotnet-bot test Linux-musl x64 Debug Build

@ghost
Copy link
Author

ghost commented Jun 28, 2018

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@ghost
Copy link
Author

ghost commented Jun 29, 2018

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@ghost
Copy link
Author

ghost commented Jul 9, 2018

Will target feature branch. Opening a new PR for that.

@ghost ghost closed this Jul 9, 2018
@ghost ghost deleted the utf8 branch July 11, 2018 14:14
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.